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Abstract 

Quantifying the proportion of polymorphic mutations that are deleterious or neutral is of fundamental 
importance to our understanding of evolution, disease genetics and the maintenance of variation genome- 
wide. Here, we develop an approximation to the distribution of fitness effects (DFE) of segregating single- 
nucleotide mutations in humans. Unlike previous methods, we do not assume that synonymous mutations 
are neutral or not strongly selected, and we do not rely on fitting the DFE of all new nonsynonymous 
mutations to a single probability distribution, which is poorly motivated on a biological level. We rely on 
a previously developed method that utilizes a variety of published annotations (including conservation 
scores, protein deleteriousness estimates and regulatory data) to score all mutations in the human genome 
based on how likely they are to be affected by negative selection, controlling for mutation rate. We map 
this score to a scale of fitness coefficients via maximum likelihood using diffusion theory and a Poisson 
random field model on SNP data. Our method serves to approximate the deleterious DFE of mutations 
that are segregating, regardless of their genomic consequence. We can then compare the proportion 
of mutations that are negatively selected or neutral across various categories, including different types 
of regulatory sites. We observe that the distribution of intergenic polymorphisms is highly peaked at 
neutrality, while the distribution of nonsynonymous polymorphisms is bimodal, with a neutral peak and 
a second peak at s ss — 10~ 4 . Other types of polymorphisms have shapes that fall roughly in between 
these two. We find that transcriptional start sites, strong CTCF-enriched elements and enhancers are 
the regulatory categories with the largest proportion of deleterious polymorphisms. 
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Author Summary 

The relative frequencies of polymorphic mutations that are deleterious, nearly neutral and neutral is 
traditionally called the distribution of fitness effects (DFE). Obtaining an accurate approximation to 
this distribution in humans can help us understand the nature of disease and the mechanisms by which 
variation is maintained in the genome. Previous methods to approximate this distribution have relied 
on fitting the DFE of new mutations to a single probability distribution, like a normal or an exponential 
distribution. Generally, these methods also assume that a particular category of mutations, like synony- 
mous changes, can be assumed to be neutral or nearly neutral. Here, we provide a novel method designed 
to reflect the strength of negative selection operating on any segregating site in the human genome. We 
use a maximum likelihood mapping approach to fit these scores to a scale of neutral and negative fitness 
coefficients. Finally, we compare the shape of the DFEs we obtain from this mapping for different types 
of functional categories. We observe the distribution of polymorphisms has a strong peak at neutrality, 
as well as a second peak of deleterious effects when restricting to nonsynonymous polymorphisms. 

Introduction 

Genetic variation within species is shaped by a variety of evolutionary processes, including mutation, 
demography, and natural selection. With the advent of whole-genome sequencing, we can make unprece- 
dented inferences about these and other processes by analyzing population genomic data. An important 
goal is to understand the extent to which segregating genetic variants are impacted by natural selection, 
and to quantify the intensity of natural selection acting genome-wide. Understanding the prevalence of 
different modes of selection on a genomic scale has wide-ranging implications across evolutionary and 
medical genetics. For instance, genome-wide association studies (GWAS) are searching for mutations 
associated with disease in large samples of humans [1]. Because mutations associated with disease are a 
priori likely to be deleterious, quantifying the portion of mutations that are deleterious along with their 
average effects could have significant implications for the design and interpretation of GWAS. Moreover, 
the ENCODE project has recently claimed that much of the genome is involved in some form of func- 
tional activity [2,3]. However, the extent to which molecular signals identified by this project are actually 
produced by biological processes affecting fitness has been disputed [4,5]. Indeed, comparative genomics 
studies suggest that only a small proportion of the human genome (5 — 10%) is under purifying selection, 
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based on signals detectable on phylogenetic timescales [6-8]. Quantifying the DFE in noncoding regions 
is a first step toward understanding the fitness implications of functional activity at the genomic level. 

Traditionally, studies have sought to estimate the distribution of fitness effects (DFE) for nonsyn- 
onymous mutations by using summary statistics based on the number of polymorphisms and substitu- 
tions [9-11] and/or the full frequency spectrum [12-14]. These studies typically assumed that synonymous 
variation is neutral or under weak selection. Many of these analyses suggest that while a large proportion 
of nonsynonymous mutations are nearly neutral, there is also a significant probability that an amino acid 
changing mutation will be strongly deleterious. While these studies were limited to analysis of protein- 
coding genes, recent work has focused on quantifying the DFE in regulatory regions, including short 
interspersed genomic elements such as enhancers [15,16] and cis-regulatory regions [17]. Reviews of these 
and other approaches can be found in ref. [18,19]. 

There are several obstacles to quantifying the DFE of new or segregating mutations genome- wide. 
First, inferences about the DFE are confounded by demography [20]. For example, a high proportion 
of low frequency derived alleles is a signature of negative selection, but can also be caused by recent 
population growth [21]. Hence, a well-supported demographic model must be used to appropriately 
control for population history when inferring the DFE. Second, most current methods rely on dividing 
up polymorphisms into either putatively selected sites or putatively neutral (or less selected) sites (for 
example, nonsynonymous and synonymous sites, respectively). These studies have relied on fitting a 
demographic moel to the neutral class of sites and then fitting the DFE of new mutations to a probability 
distribution, typically an exponential or gamma distribution [9, 13] to the class of sites that are putatively 
under selection (e.g. nonsynonymous sites). While flexible, these distributions may miss some important 
features of the DFE [22]. For example, mutation accumulation experiments suggest that the DFE may be 
bimodal for at least some species, with most mutations either having nearly neutral or strongly deleterious 
effects, and very few mutations falling in between [23,24]. Thus, assuming two classes of sites may not 
serve to capture all the relevant information about the DFE (but see [25] for an example of fitting a 
multimodal DFE to population genetic data and [22,26] for nonparametric approaches to estimating the 
DFE of new amino-acid changing mutations). Finally, previous studies have been restricted to analyzing 
specific subclasses of mutations (e.g. nonsynonymous, enhancers, etc.) because until recently, no single 
metric existed that could serve to compare the disruptive potential of any type of variant, regardless of 
its genomic consequence. 
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83 Recently, Kircher et al. [27] developed a method to synthesize a large number of annotations into a 

84 single score to predict the pathogenicity or disruptive potential of any mutation in the genome. It is 
as based on an analysis comparing real and simulated changes that occurred in the human lineage since 
se the human-chimpanzee ancestor, and that are now fixed in present-day humans. The method relies 
a? on the realistic assumption that the set of real changes is depleted of deleterious variation due to the 
88 action of negative selection, which has pruned away disruptive variants, while the simulated set is not 
so depleted of such variation. A support vector machine (SVM) was trained to distinguish the real from the 

90 simulated changes using a kernel of 63 annotations (including conservation scores, regulatory data and 

91 protein deleteriousness scores) , and then used to assign a score (C-score) to all possible single- nucleotide 

92 changes in the human genome, controlling for local variation in mutation rates. These C-scores are meant 

93 to be predictors of how disruptive a given change may be, and are comparable across all types of sites 
9i (nonsynonymous, synonymous, regulatory, intronic or intergenic). Thus, they allow for a strict ranking 

95 of predicted functional disruption for mutations that may not be otherwise comparable. C-scores are 

96 PHRED scaled, with larger values corresponding to more disruptive effects. 

97 Importantly, human-specific genetic variation patterns are not used as input to train the C-score SVM. 

98 In this work, we make use of the C-scores to provide a fine-grained stratification of the deleteriousness 

99 of variants segregating in modern human populations. We take advantage of the Poisson random field 

100 model [28,29] with a realistic model of human demographic history to fit a maximum likelihood selection 

101 coefficient for each C-score, creating a mapping from C-scores to selection coefficients. 

102 Results 

103 A mapping from C-scores to selection coefficients 

ioi To map C-scores to selective coefficients, we obtained allele frequency information from 9 Yoruba (YRI) 

105 individuals (18 haploid sequences) sequenced to high-coverage using whole-genome shotgun sequencing 

106 as part of a dataset produced by Complete Genomics (CG) [30]. We removed sites that had a Duke 

107 Uniqueness 20bp-mapability score < 1 (downloaded from the UCSC Genome Browser, [31]), to avoid 
los potential errors due to mismapping or miscalling in regions of the genome that are not uniquely mapable. 
109 When inferring the DFE, we focused only on models of neutral evolution and negative selection, 
no because C-scores are uninformative about adaptive vs. deleterious disruption (i.e. a high C-score could 
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1 either reflect a highly deleterious change or a highly adaptive change) . Additionally, because we are using 

2 polymorphism data only, positive selection should contribute little to the site-frequency spectrum [32] . 
We first binned polymorphisms into C-scores rounded to the nearest integer and computed the site 

i frequency spectrum for each bin (Figure SI). We then fit the lowest possible C-score (C= 0), presumed 

s to be neutral, to different models of demographic history. We computed the likelihood of the SFS in 

e this bin for a constant population size model, a range of exponential growth models, the model inferred 

7 by Tennessen et al. [33] and the model inferred by Harris and Nielsen [34] from the distribution of 

8 tracts of identity by state (IBS) (Figure S2), and used an EM algorithm to correct for ancestral state 

9 misidentification (Figure S3, see Materials and Methods). We find that a model of exponential growth 

20 at population-scaled rate = 1 for 13,000 generations is the best fit to the corrected SFS, although the 

21 Tennessen model is almost as good a fit (Figure S2) . 

22 Using the best-fitting demography, we next fit a range of models with different selection coefficients 

23 to the EM-corrected site frequency spectrum for each C-score bin, using maximum likelihood (Figure 

24 l.A) (see Methods). We restricted to C < 40, because very few sites have larger C-scores, and hence 

25 estimates of the selection coefficients for those C-scores are unreliable. We tested the robustness of the 

26 mappings to different levels of background selection, by partitioning the data into deciles of B-scores [35] 

27 and re-computing the C-to-s mapping for each decile. We observe that the mapping is generally robust 
2s to background selection, with substantial differences only observed at the lowest two B-score quantiles, 

29 which correspond to high background selection (Figure S4). For this reason, and so as to obtain reliable 

30 DFEs at exonic sites (where background selection is generally higher than in the rest of the genome) , we 

31 also performed a neutral demographic fitting and a C-to-s mapping while restricting only to sites in the 

32 exome (Figure l.C). This mapping has a steeper decline than the genomic mapping, reflecting patterns 

33 of background selection which are not fully controlled by C-scores but that affects the SFS. We therefore 
3i show estimated DFEs using both the genome-wide and the exome-wide fittings below. After removing 

35 the C-score bins that best fit the neutral model, we fit a smoothing spline with 20 degrees of freedom 

36 to the remaining C-scores, so as to produce a continuous mapping of C-scores to selection coefficients 

37 (Figure l.A). 

38 We were concerned that our binning-based mapping would miss important features about the dis- 

39 tribution of coefficients within each bin. To address this, we also fitted individual gamma distributions 

40 of selection coefficients to each of the bins. We show the mean, standard deviation (SD) and ancestral 
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141 misidentification rate of each gamma fitting in Figure S3. The shape of the fitted gammas vary from 

142 an L-shape (Mean/SD < 1) at low C bins, to a shape resembling a skewed normal distribution at in- 

143 termediate C bins (Mean/SD > 1) to a shape resembling an exponential distribution at high C bins 

144 (Mean/SD w 1) (Figure l.D). We performed a likelihood ratio test comparing the gamma model to the 

145 single-coefficient model, and find that only 4 out of the 40 bins (containing only 0.5% of all polymor- 

146 phisms and 4.7% of nonsynonymous polymorphisms) are significantly supportive of the gamma model 

147 (Figure l.E). A chi-squared test of the fit to the data shows both models perform similarly well, though 
i4s both result in significant chi-squared scores at low C-score bins when using the genome-wide data (Figure 

149 l.F). We attribute this to the large amount of data present in those bins as well as complex details 

150 of demographic history that affect neutral sites but that we did not model in our neutral fitting. In 

151 contrast, when mapping using only the exome, almost all bins have non-significant statistics, suggesting 

152 that selection dominates over demography in these regions. Based on these results, we decided to use 

153 the smoothed single-coefficient fitting for estimating the DFE in most downstream analyses, although we 

154 may be missing a small proportion of within-bin variability. Additionally, we show the inferred DFE of 

155 all, nonsynonymous and synonymous polymorphisms obtained from the gamma-fitted mapping in Figure 

156 S5. 

157 We aimed to test the robustness of the selection coefficient estimates within each bin. We were specif- 
i5s ically concerned about highly deleterious bins, which are composed of a smaller number of segregating 

159 sites than neutral or nearly neutral bins, and could produce unstable or biased estimates. We obtained 

160 bootstrapped confidence intervals for each bin and observe that the mappings are relatively stable up to 

161 C= 35 (Figure I. A). As expected, the standard deviation of the bootstrap estimates is strongly negatively 

162 correlated with the sample-size per bin (Figure S6, Pearson correlation coefficient = -0.89). Thus, most 

163 of the increase in the width of the confidence intervals observed at higher C-score bins can be explained 

164 by the small number of polymorphisms available in those bins, and is likely not the result of other un- 

165 accounted processes, such as positive selection, operating exclusively on highly scored polymorphisms, 
lee To verify that our mapping was not an artifact of the different number of C-scores within each bin, we 
167 also performed 100 randomizations of the C-score assignments to each SNP in the genome. For each 
lea randomization, we re-computed the C-to-s mapping. When doing so, the bootstrap confidence intervals 

169 increase in size with increasing C scores, but the mapping remains flat, as expected (Figure l.B). 

170 Further, we verified that the mapping did not change considerably when filtering for sites in regions 
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171 with low CpG density (< 0.05), denned as the proportion of CpG dinucleotides in a window of +/- 

172 75bp around the site [27] (Figure S7.A). This is expected, as the C-score model accounts for differential 

173 mutation rates at CpG sites and the resulting scores are generally robust to them [27]. As before, the 

174 gamma model is a significantly better fit than the single-coefficient model at only 4 out of the 40 bins 

175 (Figure S7.B). 

176 Additionally, we re-mapped the scores using a constant-size model, and verified that the mapping 

177 does not change considerably if we assume a different demographic history than the best fit (Figure S8). 
i7s The mappings are highly similar in shape, with the exception that, because the constant-size model is 
179 depleted of singletons relative to the best-fit model, it takes more bins to reach an SFS that is deleterious 
lso enough to map to s ^ 0, and so more C-scores map to s= 0. 

lsi To test the dependence of our mapping on the choice of score used to estimate selection coefficients, 

182 we performed the same fitting procedure on a variety of other conservation and deleteriousness scores (see 

183 Methods). We note, however, that all of these scores are included as input in the C-score SVM. Figure 

184 S9 shows that the shape of the mapping is fairly consistent across different choices of scores, except for 
las highly deleterious bins, which contain very few sites. When comparing different categories of sites in the 
186 Results, we show their distribution of selection coefficients obtained from the C-score mapping, as this 
is? score has been shown to be a better correlate to functional disruption than all the other scores mentioned 
las above, and also controls for mutation rate variation across the genome, while other scores do not [27]. 

189 Additionally, we plot the mapped density of selection coefficients (with smoothing bandwidth = 0.000005) 

190 for all polymorphisms as well as synonymous and nonsynonymous polymorphisms using each of the other 

191 scores in Figure Sf 0. We observe that, while all scores easily distinguish genie sites, PhastCons scores 

192 have difficulty distinguishing between synonymous and nonsynonymous sites, which we believe is due to 

193 PhastCons scores being regional, rather than position-specific scores. Additionally, while bimodality at 

194 genie sites is most prominent when using C-scores, it also becomes apparent in other position-specific 

195 scores when plotting the density with a finer smoothing bandwidth (= 0.000001, Figure Sll). 

196 The distribution of fitness effects of segregating mutations 

197 Using the C-score-to-selection coefficient mapping, we obtained the DFE of segregating polymorphisms 

198 in Yoruba individuals. This distribution is highly peaked when all polymorphisms are considered (Figure 

199 2, black dashed line), with a large proportion of neutral changes and a long tail of deleterious mutations, 
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200 as has been observed before when estimating the DFE of coding sequences [9, 11-13,20]. Interestingly, 

201 we observe a pronounced drop in frequency for values of s < — 10 . We note that this is not due to our 

202 capping our mapping at C — 40 as the selection coefficients we are able to map are of a greater magnitude 

203 than this drop (Figure 1, S12). 

204 We partition the data by the genomic consequence of the polymorphisms, using the Ensembl Variant 

205 Effect Predictor (v. 2. 5) [36]. Some classes exhibit a peak of highly deleterious changes around s = 

206 — 10~ 4 . This peak results in a bimodal distribution that is especially pronounced for nonsynonymous 

207 sites (Figure 2, red line), and is almost non-existent for intergenic sites (Figure 2, pink line). Other types 
2os of polymorphisms — like splice site, synonymous, 3' UTR, 5' UTR and regulatory mutations — have less 

209 deleterious peaks than the one observed at nonsynonymous polymorphisms (Figure 2). We can compare 

210 the selection coefficient distributions to the distributions of unmapped C-scores (Figure S12) which are 

211 much less tightly peaked at intermediate C-score values and do not show a sharp decrease in density for 

212 high values, as does the s distribution in Figure 2. We show various statistics calculated on each of the 

213 selection coefficient distributions in Table 1 with the genome- wide mapping and in Table SI with the 

214 exome-wide mapping. 

215 Next, we partitioned the data by whether the polymorphisms were found in the GWAS database [37] 

216 or not (Figure S13, Tables 1, SI). While we observe a second deleterious peak among the GWAS SNPs 

217 as well, these SNPs seem to be highly enriched for neutral polymorphisms. 

2is Finally, we classified polymorphisms by different regulatory categories. We used two regulatory tracks 

219 downloaded from the UCSC Genome Browser [31,38]. First, we partitioned the genome by regulatory 

220 regions identified by Regulome DB [39], which predicts regulatory activity in noncoding regions based 

221 on different types of experimental evidence (Figure S14, Tables 1, SI). Second, we used the Segway 

222 regulatory segment tracks [40], which are the product of an unsupervised pattern discovery algorithm 

223 that serves to identify and label regulatory regions along the genome, including genie regions (Figure 3, 

224 Tables 1, SI). 

225 Discussion 

226 The distribution of fitness effects (DFE) describes the proportion of mutations with given selection 

227 coefficients. Knowledge of the DFE has profound implications for our understanding of evolution and 
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228 health. We infer a highly peaked distribution for all polymorphisms, with a drop in density at s ~ — 10~ 4 , 

229 which may be the cutoff between weakly deleterious mutations that segregate in human populations and 

230 highly deleterious mutations that are easily pruned away by negative selection. 

231 Our inferred non-synonymous distribution is bimodal and looks very similar to the one obtained for 

232 nonsynonymous mutations in Drosophila in ref. [11], with a peak at neutrality and another peak at s « 

233 — 1CP 4 . Several experimental studies have also shown that non-synonymous non-lethal mutations tend to 

234 have a multimodal DFE in model organisms [41,42] (see ref. [18] for a comprehensive review). We note that 

235 it is impossible to obtain such kinds of distributions using a gamma or lognormal probability distribution 

236 unless one approximates bimodality by assuming a second, separate class of nonsynonymous mutations 

237 that arc completely neutral and do not follow the best-fitting probability distribution [11, 13,20,25]. 

23s We also tested the precision of the C-scores by fitting gamma distributed DFEs to each C-score bin. 

239 While only very few bins were fit by a highly peaked gamma distribution (Figure ID), the increased 

240 variation offered by the gamma distribution rarely improved the likelihood significantly (Figure IE). 

241 This suggests that the C-scores are precise quantifications of negative selection, and that mutations with 

242 similar C-scores are likely to have similar selection coefficients. 

243 Interestingly, we found that for many low C-score bins, a chi-squared test would reject the fit of the 

244 demographic model to the data. This is likely because these low C-score bins have a significant number 

245 of sites, and hence subtle features of the demography and quality control are relevant. Nonetheless, we 

246 note that for moderate-to-high C-score bins and for exonic data, we would not be able to reject the fit 

247 of the the predicted site frequency spectrum from the data. While these bins have fewer sites, it is also 

248 likely that stronger selection is obscuring some of the signal of subtle demographic events. 

249 Our novel expectation-maximization approach to quantifying ancestral state misidentification allows 

250 us to assess differential misidentification rates across C-score categories. Ancestral state misidentification 

251 occurs because a site is hit by more than one mutation, hence obscuring the identity of the ancestral 

252 allele. Here, we found that low C-score bins are enriched with ancestral state misidentification, with rates 

253 in excess of 5% for some bins (Figure S3). This may reflect a bias induced by the C-scores themselves, 

254 because C-scores are trained to distinguish classes of sites that have relatively few substitutions between 

255 humans and chimpanzees. Because the signal of ancestral state misidentification and positive selection 

256 are very similar [43], it is possible that low C-score bins are enriched for positive selection, although we 

257 did not pursue that direction any further. For larger C-score bins, we infer misidentification rates along 
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258 the lines of those obtained in simulation studies by ref. [43]. 

259 Importantly, unlike previous studies, we also obtain DFEs for other types of mutations, including 

260 synonymous, splice site, 3' UTR, 5' UTR and regulatory polymorphisms, which exhibit bimodality to 

261 a lesser degree than the nonsynonymous DFE. In particular, 5' UTR changes constitute the category 

262 with the smallest proportion of neutral or nearly neutral (|s| < 10~ J ) polymorphisms after nonsynony- 

263 mous changes, likely reflecting selection on gene regulation upstream of coding sequences. Futhermorc, 

264 distributions corresponding to mutations in UTR and regulatory regions have a less pronounced trough 

265 between the two peaks than the ones observed among coding mutations, suggesting that the magnitude 

266 of deleterious effects is more uniformly distributed in non-coding regions. In contrast, missense mutations 

267 appear to have more of an "all-or-nothing" effect, as would perhaps be expected when replacing an amino 

268 acid inside a protein. 

269 Our method does not use synonymous sites as a neutral benchmark, as do other studies [9,11,20]. In 

270 fact, our inferred DFE suggests that a considerable number of synonymous polymorphisms may not be 

271 neutral, as we observe a second deleterious peak in them too, albeit less deleterious than the one observed 

272 at nonsynonymous polymorphisms. We caution, however, that this second peak is less promient when 

273 using an exome-specific mapping (Figure 2) or when using strictly position-specific scores (Figures S10, 

274 Sll), which suggests that at least part of this peak may be caused by regional patterns of conservation or 

275 background selection in the exomes. Instead, intergenic polymorphisms are the class of sites most likely to 

276 evolve neutrally. Because this class is so abundant, most of the signal observed when all polymorphisms 

277 are pooled together closely reflects the distribution observed for intergenic polymorphisms (Figure 2). 

278 Our results have implications for GWAS, as we find a high proportion of GWAS SNPs to be neutral 

279 or nearly neutral, which could suggest a high rate of false positives in this type of association studies. 

280 On the other hand, GWAS studies only aim to find polymorphisms linked to causative variants, and so 

281 GWAS SNPs need not have strongly deleterious effects. Alternatively, if the effect size of many GWAS 

282 SNPs are sufficiently small, it is possible that many of them are not subject to strong selection. 

283 Additionally, by stratifying our results based on different ENCODE categories, we can elucidate the 

284 fitness consequences of molecular activity signals detected by ENCODE [2,3,39]. We find the category 

285 with the lowest proportion of neutral polymorphisms to be the one corresponding to sites that have eQTL 

286 evidence as well as evidence for transcription factor (TF) binding, a matched TF motif, a matched DNase 

287 footprint and that are located in a DNase peak. In general, categories that combine many regulatory 
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28s signals tend to show lower proportions of neutral mutations than those that do not, suggesting that data 

289 integration across distinct approaches to detecting selection and functionality is likely to do better than 

290 any individual approach [44]. Moreover, this suggests that much of the molecular activity detected by 

291 ENCODE may not have significant fitness consequences. 

292 Stratification by Segway regions allows us to look at a different aspect of regulatory activity in the 

293 genome. Interestingly, we observe that polymorphisms in Transcription Start Sites (TSS) are the ones 

294 containing the largest proportion of deleteriousness. This echoes results from analyses of variation at 

295 transcription factor binding sites, which have been found to be under stronger constraint when found 

296 near TSS than when found far from them [45]. Other regions with high proportions of deleterious 

297 polymorphisms include Gene Body (Start), strong CTCF and Enhancer / Gene Middle. This suggests 

298 that regions with strong repressor, insulator or enhancer activity, as well as near the start of genes, 

299 are particularly important for biological function, perhaps unsurprisingly given our knowledge of the 

300 molecular role that these regions play in the regulation of transcription. 

301 There are several limitations to our method. First, we have restricted ourselves to estimating the 

302 DFE of segregating mutations that have reached appreciable frequencies in the population. An extension 

303 of this approach would be to infer the DFE of new mutations from the DFE of segregating mutations 

304 genome-wide. Second, we assumed no dominance, epistasis or positive selection, which future studies 

305 could attempt to incorporate into our approach. In addition, we have assumed sites are independent and 

306 have therefore ignored the covariance between linked sites, which likely leads to an underestimatation 

307 of confidence intervals obtained from the bootstrapping. The free-recombination assumption may also 

308 affect inference due to Hill-Robertson interference between mutations subject to selection [46] as well 

309 as linked background selection affecting the SFS of neutral sites in the human genome [35] . This may 

310 be a more important issue in our case than other genic-only approaches because we are also including 
3n intergenic mutations in our analysis, so the space between analyzed polymorphisms is on average smaller 

312 than if we were only looking at coding polymorphisms [20]. We also assume no positive selection. This, 

313 however, should not be a major problem, because we are only basing our inferences on polymorphic 

314 sites and advantageous mutations contribute little to polymorphism, assuming N e s > 25 [32]. One final 

315 limitation is that the type of inference performed here is only possible in species in which C-scores have 

316 been estimated (for now, humans only) and that it therefore relies on C-scores being able to correctly 

317 rank sites by deleteriousness. Nevertheless, it should not be hard to obtain accurate C-scores for other 
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3is organisms in the future, although limitations on available annotations for non-human organisms may 
319 make the approximation to the fitness distribution less accurate than in humans. 



320 Materials and Methods 

321 Computing the theoretical site frequency spectrum 

322 We used the theory developed by Evans et al. [47] to obtain the expected population site frequency 

323 spectrum with non-equilibrium demography. We assume a Wright-Fisher population in the limit of large 

324 population size and use diffusion theory to model this process. Writing f(x, t) for the frequency spectrum 

325 at frequency x and time t = 2N(0)t where r is in units of generations and g(x,t) :— x(l — x)f{x,t), 

326 we can approximate the dynamics of g(x,t) with genie selection and mutation by solving the following 

327 partial differential equation: 

|. 9 (M) = -Sx(l - *)±Wx,t)] + f^^^t)] (1) 

328 subject to the boundary condition: 



]im g(x,t) = dp(t) (2) 

x— >0 

329 where S is the population-scaled selection coefficient (S — 2N(0)s), 9 is the population-scaled mutation 

330 rate (6 — 4N(0)p) and p(t) = N(t)/N(0) is the effective population size at time t relative to the population 

331 size at time 0. 

332 We assume N(0) to be 10,000 for all exponential and constant models. For the constant population size 

333 model, p(t) = 1. For the exponential growth model pit) = exp(rt) where r — 2N(Q)R is the population- 

334 scaled growth rate and R is the per-generation growth rate. For models taken from the literature, we use 

335 the N(0) assumed by the corresponding paper. For the model of Harris and Nielsen, pit) is piece-wise 

336 defined according to their Figure 7. The Tennessen model is similarly defined in a piece-wise fashion 

337 according to their Figure 2, although it also includes periods of exponential growth, as opposed to simply 

338 being piece-wise constant as in the Harris and Nielsen model. 

339 We solve for gix,t) numerically in Mathematics, and can then compute the expected number of 

340 segregating sites with i copies of the derived allele out of a sample of n genes. Denoting by g s ix,t) the 
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34i theoretical SFS with selection coefficient s, this quantity is 



oo / />1 

U,i(t) = J U ( '• J 2 ^ 1 ~ x) n ~ i ~ 1 9-.(x, t)da\ p(s)ds. (3) 
342 where p(s) is the parameterized distribution of selection coefficients. For gamma distributed fits, 

p(s) 



/3~ Q r(a) 

343 where a and /3 are the shape and rate parameters of the gamma distribution and T(-) is the gamma 

344 function. For a point mass at s, 

p(s) = S(s — s) 

345 where <$(■) is the usual Dirac delta function. 

346 We focused on fitting the shape of the SFS, and hence worked with the probability that a given site 

347 in a sample of n has i copies of the derived allele, 



In ^) 
X/i" = l fn,j(t) 



PnAt) = , 7TV- ( 4 ) 



348 The Mathematica code used for obtaining the theoretical SFS can be found online at: 

349 http: / /malecot. popgen.dk/schraiber/ 

350 Expectation maximization algorithm to fit ancestral state misidentification 

351 rates 

352 We observed that the SFS showed signs of ancestral state misidentification, in particular an excess of 

353 high frequency derived alleles (Figure S2). To account for the ancestral state misidentification errors, 

354 we developed an expectation maximization (EM) algorithm. In the E step, we estimate the posterior 

355 probability that a site at frequency i out of n is misidentified given the current estimated site frequencies, 

356 {Pn\, 1 < i < n}, and the current estimate of the misidentification rate, p^ isl as 

(fe) (fe) fr\ 
m i^PniPmis- ( 5 ) 
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357 Then, during the M step, we re-estimate the misidentification rate as 

n-l 
i=l 

35s where is the number of sites at frequency i. We next re-estimate either the demographic parameters 
359 or the parameters of the selected model using the log-likelihood 



361 Maximum likelihood fitting of demographic models 

362 The exponential growth model has two free parameters, r, the population-scaled growth rate and t, the 

363 total time of exponential growth. We first obtained the site frequency spectrum for all sites with C = 0. 

364 Next we solved g(x, t) for the exponential growth model across a grid of t and r, as well as the constant, 

365 Harris and Tennessen models, and applied our EM algorithm to estimate the best fitting demographic 

366 model, using a grid search over demographic models during the M step. 

367 Maximum likelihood fitting of selection coefficients 

36s To find the maximum likelihood estimate of s for each C-score bin, we first obtained the site frequency 

369 spectrum corresponding to each C-score bin. Next, we solved g{x,t) under the fitted demography for 

370 logio(— s) G [—7,-1.3] in steps of 0.005, along with s = 0. To obtain an estimated SFS under the 

371 assumption of gamma distributed selection coefficients, we used the trapezoid rule to numerically integrate 

372 against a gamma distribution as in formula 3. 

373 We used our EM algorithm to estimate the best fitting selection coefficient for each bin. When fitting 

374 a single coefficient, we used a grid search during the M-step, and when fitting gamma distributed selection 

375 coefficients, we used the L-BFGS-B algorithm. To plot the DFE, we used kernel density estimation with 

376 smoothing bandwith = 0.000005, unless otherwise stated. 



n-l 




(7) 



360 



to obtain the theoretical SFS for the next iteration, {pft 1- , 1 < i < n} 
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377 Mapping using different scores 

378 To test how robust the mapping of C-scores to selection coefficients is to different types of conservation 

379 scores, we obtained PhyloP [48] and PhastCons [49] scores derived from vertebrate, mammal and primate 
3so alignments (excluding humans), as well as GERP++ rejected substitution (Gerp S) scores [50], for all 

381 YR1 SNPs [27]. We attempted to equalize the range of all scores by PHRED-scaling them, i.e. converting 

382 each score to -logio (p) where p is the probability of observing a change as or more disruptive / conserved 

383 (based on that particular score scale) among all polymorphic YR1 sites. We note that this is different 

384 from the natural PHRED scale of C-scores (where p is the the probability of observing a score as or more 

385 disruptive among all possible, but not necessarily realized, mutations in the human genome), and so we 

386 also re-scaled the C-scores to produce a fair comparison. Then, we repeated the maximum likelihood 

387 mapping for each PHRED-scaled score in bins of 0.25 units (e.g. 0-0.125, 0.125-0.375, 0.375-0.625, etc). 

388 It is important to note that PhastCons are regional scores, while PhyloP and Gerp S are position-specific 

389 scores. Another difference is that PhastCons scores only measure the probability of negative selection, 

390 while PhyloP and Gerp S scores also measure positive selection (i.e. low/negative scores represent faster 

391 evolution than expected purely under drift), which may be why we observe an uptick at the lower end of 

392 the mapping for those scores in Figure S9. 
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Figures 



A) Bootstrapped mapping of C-scores to selection coefficients 
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Figure 1. Mapping of C-scores to selection coefficients. A) We first fit a single coefficient to 
each bin using data from all autosomes in the genome. Red dots represent the maximum likelihood 
selection coefficient corresponding to each C-score bin. The blue line is a smooth spline fitted to the 
discrete mappings of C-scores to log-scaled selection coefficients after excluding the neutral bins 
(degrees of freedom = 20). The grey shade is a 95% confidence interval obtained from bootstrapping 
the data 100 times in each bin. B) To verify the shape of the mapping was not a result of the number of 
sites in each bin, we randomized the C-score labels across polymorphisms, and recomputed the 
mapping. C) To account for exonic patterns of conservation and background selection that may not 
have been captured by C-scores, we computed a mapping based solely on exonic sites. D) We fitted 
gamma distributions of selection coefficients to each bin, and computed the mean divided by the 
standard deviation of each distribution, which is indicative of its shape (see Results). E) To test 
whether the gamma distributions were a significantly better fit than the single-coefficient mapping, we 
computed log-likelihood ratio statistics of the two models at each bin. The black line denotes the 
Bonferroni-corrected significance cutoff. F) To test how well we were fitting the data at each bin, we 
computed chi-squared statistics of the fitted SFS to the observed SFS at each bin. The black line 
denotes the Bonferroni-corrected significance cutoff. 
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Figure 2. Distribution of fitness effects among YRI polymorphisms in the Complete 
Genomics dataset, partitioned by the genomic consequence of the mutated site. The right 
panels show a zoomed-in version of the same distributions after removing neutral polymorphisms and 
log-scaling the x-axis. Top row: DFE obtained from genome-wide mapping. Bottom row: DFE 
obtained from exome-wide mapping. Consequences were determined using the Ensembl Variant Effect 
Predictor (v. 2. 5). If more than one consequence existed for a given SNP, that SNP was assigned to the 
most severe of the predicted categories, following the VEP's hierarchy of consequences. NonSyn = 
nonsynonymous. Syn = synonymous. Syn to unpref. codon = synonymous change from a preferred to 
an unpref erred codon. Syn to pref. codon = synonymous change from an unpreferred to a preferred 
codon. Syn no pref. = synonymous change from an unpreferred codon to a codon that is also 
unpreferred. Splice = splice site. CG = Complete Genomics data. 
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Figure 3. Distribution of fitness effects among YRI polymorphisms, partitioned by 
Segway regulatory segments. The right panel shows a zoomed-in version of the same distributions 
after removing neutral polymorphisms and log-scaling the x-axis. 
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Table 1. Characteristics of fitness effect estimated for YRI SNPs classified by different genomic 
consequence, RegulomeDB and Segway categories, using the genome-wide C-to-s mapping. We show 
quantiles of selection coefficients, the log base f 0 of the mean selection coefficient and the log base 10 of 
the standard deviation of coefficients in each category. 



Category 


n 


H < 10~ 5 


|s| > 10~ 5 


|s| > 5*10~ 5 


|s| > 10~ 4 


logio(-s) 


logio(SD(- 


All 


9065398 


73.77% 


26.23% 


0.09% 


0% 


-5.16 


-4.95 


Nonsynonymous 


32522 


28.88% 


71.12% 


2.07% 


0.14% 


-4.6 


-4.74 


Synonymous 


30630 


42.08% 


57.92% 


0.01% 


0% 


-4.81 


-4.87 


Synonymous 4-fold degenerate 


16895 


41.93% 


58.07% 


0% 


0% 


-4.81 


-4.87 


Synonymous to unpreferrcd codon 


12498 


40.01% 


59.99% 


0.01% 


0.01% 


-4.79 


-4.86 


Synonymous to preferred codon 
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38.43% 


61.57% 


0% 


0% 


-4.78 


-4.87 


Synonymous no preference change 


11844 


41.80% 


58.20% 


0.02% 


0% 


-4.81 


-4.87 


Splice site 


10122 


52.11% 


47.89% 


0.05% 


0% 


-4.87 


-4.84 


5' UTR 
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37.53% 


62.47% 


0.11% 


0.02% 


-4.76 


-4.84 


3' UTR 


81605 


49.05% 


50.95% 


0.19% 


0.02% 


-4.88 


-4.87 


Regulatory 


864769 


58.72% 


41.28% 


0.02% 


0% 


-4.99 


-4.92 


Intergenic 


3544204 


77.69% 


22.31% 


0.16% 


0% 


-5.22 


-4.96 


GWAS 


8693 


65.71% 


34.29% 


0.18% 


0% 


-5.04 
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eQTL+TF binding+matched TF mo- 


240 
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0% 
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eQTL+TF binding-^any motif+DNase 
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58.04% 


41.96% 


0.08% 


0% 


-4.97 


-4.9 


peak 
















eQTL+TF binding+matched TF motif 


44 


65.91% 


34.09% 


0% 


0% 


-4.95 


-4.83 


eQTL+TF binding / DNase peak 


26610 


63.87% 


36.13% 


0.06% 


0% 


-5.03 


-4.92 


TF binding+matched TF mo- 


12411 


50.05% 


49.95% 
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0% 
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0% 
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65.45% 


34.55% 
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0% 
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-4.92 


peak 
















TF binding+any motif+DNase peak 
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63.14% 


36.86% 


0.16% 


0% 
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-4.9 


TF binding+matched TF motif 
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73.34% 


26.66% 


0.07% 


0% 


-5.15 


-4.95 


TF binding+DNasc peak 


418548 


59.95% 


40.05% 


0.10% 


0% 


-4.99 


-4.91 


TF binding or DNase peak 


1625195 


69.18% 


30.82% 


0.14% 


0% 


-5.09 


-4.93 


CO - CTCF (strong) 


20813 


57.22% 


42.78% 


0.02% 


0.01% 


-4.98 


-4.94 


CI - CTCF (weak) 


61946 


65.88% 


34.12% 


0.05% 


0% 


-5.08 


-4.96 


D - dead zone 


873933 


81.44% 


18.56% 


0.06% 


0% 


-5.3 


-5 


E/GM - enhancer/gene middle 


70593 


59.27% 


40.73% 


0.06% 


0% 


-4.99 


-4.91 


F0 - FAIRE only 


1177948 


71.94% 


28.06% 


0.11% 


0% 


-5.13 


-4.94 


Fl- FAIRE only 


1481766 


74.45% 


25.55% 


0.10% 


0% 


-5.17 


-4.95 


GEO - gene body (end) 


413390 


67.56% 


32.44% 


0.09% 


0% 


-5.06 


-4.91 


GE1 - gene body (end) 


163365 


63.99% 


36.01% 


0.10% 


0% 


-5.03 


-4.91 


GE2 - gene body (end) 


54261 


69.52% 


30.48% 


0.13% 


0.01% 


-5.1 


-4.93 


GM0 - gene body (middle) 
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64.24% 


35.76% 


0.07% 


0% 


-5.04 


-4.93 


GM1 - gene body (middle) 


101426 


63.21% 


36.79% 


0.07% 


0% 


-5.04 


-4.94 


GS - gene body (start) 


54940 


41.21% 


58.79% 


0.06% 


0.01% 


-4.83 


-4.89 


H3K9mel 


457492 


87.22% 


12.78% 


0.02% 


0% 


-5.46 


-5.07 
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673274 


81.71% 


18.29% 


0.07% 


0% 


-5.31 


-5.01 


LI - low zone 


725876 


70.63% 


29.37% 


0.15% 


0% 


-5.11 


-4.94 


N/A 


53354 


95.36% 


4.64% 


0% 


0% 


-5.77 


-5.29 


R0 - repression 


716710 


74.73% 


25.27% 


0.12% 


0% 


-5.17 


-4.95 


Rl - repression 


335824 


69.73% 


30.27% 


0.09% 


0% 


-5.1 


-4.94 


R2 - repression 


447655 


68.85% 


31.15% 


0.13% 


0% 


-5.08 


-4.91 


R3 - repression 


433156 


75.62% 


24.38% 


0.09% 


0% 


-5.19 


-4.97 


R4 - repression 


205971 


73.26% 


26.74% 


0.06% 


0% 


-5.15 


-4.94 


R5 - repression 


297152 


70.84% 


29.16% 


0.09% 


0.01% 


-5.13 


-4.96 


TF0 - transcription factor activity 


346083 


73.50% 


26.50% 


0.09% 


0% 


-5.17 


-4.97 


TF1 - transcription factor activity 


327695 


77.18% 


22.82% 


0.06% 


0% 


-5.22 


-4.98 


TF2 - transcription factor activity 


127366 


71.09% 


28.91% 


0.12% 


0.01% 


-5.12 


-4.95 


TSS - transcription start site 


25144 


23.16% 


76.84% 


0.12% 


0.02% 


-4.68 


-4.88 
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Supplementary Figures 




Figure SI. Observed SFS for sites under different C-score bins using the Complete 
Genomics YRI data, for all autosomes in the genome (left) and the exome (right). Note 
that the spectrum gets more skewed towards singletons with increasing C-scores, likely reflecting the 
action of negative selection on deleterious mutations. 
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Figure S2. Observed SFS of YRI Complete Genomics data for sites with C=0. The full SFS 
was corrected for ancestral state misidentification using an EM algorithm and fit to different models of 
neutral evolution. We show results for both the genome and the exome. 
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B) Genome mapping vs Exome mapping: fitted gamma SD 
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Figure S3. Features of fitted single-coefficient and gamma distributions. A) Fitted single 
coefficients and means of fitted gamma distributions for each C-score bin, using genome-wide or 
exome-wide polymorphisms. B) Standard deviation of fitted gamma distributions for each bin. C) 
Ancestral misidentification rate obtained from an EM algorithm used to jointly fit the data and infer 
this rate at each bin. SD = standard deviation. 
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C-to-s mapping stratified by B-score deciles 
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C-score 



Figure S4. C-to-s mapping stratified by B-score deciles. We partitioned the genome by deciles 
of B-scores [35] , which reflect levels of background selection. Then, we recomputed the demographic 
fitting and C-to-s mapping for each decile. 
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Figure S5. Inferred DFEs for all, nonsynonymous and synonymous polymorphisms 
obtained from gamma distribution fitting of each C-score bin. The plot shows, for each 
category, a weighted sum of gamma distributions, where each C-score bin contributes its corresponding 
genome-wide best-fitting gamma distribution in proportion to the number of polymorphisms present at 
that bin. 
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Figure S6. Comparison between the size of each C-score bin and the standard deviation of 
single-coefficient fits obtained from 100 bootstraps of the data within each bin. Top panel: 
Standard deviation per C-score bin plotted as a function of sample size per bin (log-scale) . Bottom 
panel: Same plot but with the y-axis on a log-scale. 
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A) C-to-s mapping: all sites vs. low-CpG sites 
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Figure S7. Mapping of sites with low CpG density. A) We filtered for sites with low CpG 
density, such that the proportion of CpG sites in a +/- 75 bp window around each site was < 0.05, and 
then recomputed the C-to-s mapping. B) We also repeated the gamma fitting and calculated a 
liklclihood ratio test of the gamma model against the single-coefficient model at each C-score bin. 
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C-to-s mapping: best-fit demographic model vs. constant-size model 
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Figure S8. Comparison between a C-to-s mapping using the best-fit demographic model 
(exponential growth with t=13,000 and r=l) and a constant-size model. 




Score (common PHRED scale) Score (common PHRED scale) 



Figure S9. Maximum likelihood mapping of different types of scores to a selection 
coefficient scale, excluding bins mapped to neutrality, using the Complete Genomics data. 

Before mapping, scores were re-scaled on a common PHRED scale, by converting each score to -logio(p) 
where p is the probability of observing a change as or more disruptive / conserved (based on that 
particular score scale) among all polymorphic YRI sites. Some scores extend over larger PHRED scores 
than others because they have a finer stratification (smaller number of sites with tied scores) . The wide 
fluctuations to the right of the figures are due to the small number of sites per bin at highly deleterious 
bins. 
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Figure S10. Distribution of fitness effects at nonsynonymous (red), synonymous (blue) 
and all (black) polymorphisms in Yoruba, using different types of conservation scores for 
mapping (smoothing bandwidth = 0.000005). We only mapped sites with PHRED-scaled scores 
< 5, because the mappings become erratic for higher values, due to the small number of sites per bin 
(Figure S9). 
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Figure Sll. Distribution of fitness effects at nonsynonymous (red), synonymous (blue) 
and all (black) polymorphisms in Yoruba, using different types of conservation scores for 
mapping (smoothing bandwidth = 0.000001). We only mapped sites with PHRED-scaled scores 
< 5, because the mappings become erratic for higher values, due to the small number of sites per bin 
(Figure S9). 
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C-scores C-scores (log-scale) 



Figure S12. Distribution of unmapped C-scores among YRI polymorphisms, partitioned 
by the genomic consequence of the mutated site. Consequences were determined using the 
Ensembl Variant Effect Predictor (v. 2. 5). NonSyn = nonsynonymous. Syn = synonymous. Syn to 
unpref. codon = synonymous change from a preferred to an unpreferred codon. Syn to pref. codon = 
synonymous change from an unpreferred to a preferred codon. Syn no pref. = synonymous change from 
an unpreferred codon to a codon that is also unpreferred. Splice = splice site. 
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Figure S13. Distribution of fitness effects among YRI polymorphisms, partitioned by 
whether the SNPs are found in the GWAS database or not. The right panel shows a zoomed-in 
version of the same distributions after removing neutral polymorphisms and log-scaling the x-axis. 
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Table SI. Characteristics of fitness effect distributions estimated for YRf SNPs classified by different 
genomic consequence categories, RegulomeDB categories and Segway categories, using the exome-wide 
C-to-s mapping. We show quantiles of selection coefficients, the log base fO of the mean selection 
coefficient and the log base f 0 of the standard deviation of coefficients in each category. 



Category 


n 


|s| < 10~ 5 


|s| > 10~ 5 


|s| > 5*10~ 5 


|s| > 10~ 4 


logio(-s) 


logio(SD(- 


All 


9065398 


40.51% 


59.49% 


5.87% 


0.92% 


-4.79 


-4.7 


Nonsynonymous 


32522 


17.26% 


82.74% 


45.87% 


11.52% 


-4.28 


-4.35 


Synonymous 


30630 


22% 


78% 


9.77% 


0.06% 


-4.61 


-4.74 


Synonymous 4-fold degenerate 


16895 


22.79% 


77.21% 


10.70% 


0.03% 


-4.61 


-4.73 


Synonymous to unpreferrcd codon 


12498 


21.54% 


78.46% 


10.52% 


0.10% 


-4.59 


-4.73 


Synonymous to preferred codon 


4574 


19.52% 


80.48% 


10.08% 


0% 


-4.59 


-4.75 


Synonymous no preference change 


11844 


22.55% 


77.45% 


9.58% 


0.03% 


-4.61 


-4.74 


Splice site 


10122 


25.43% 


74.57% 


16.08% 


0.42% 


-4.6 


-4.63 


5' UTR 


23125 


13.95% 


86.05% 


18.65% 


0.88% 


-4.52 


-4.55 


3' UTR 


81605 


20.98% 


79.02% 


11.94% 


1.60% 


-4.6 


-4.58 


Regulatory 


864769 


23.20% 


76.80% 


7.42% 


0.14% 


-4.7 


-4.75 


Intergenic 


3544204 


44.21% 


55.79% 


5.35% 


1.12% 


-4.82 


-4.69 


GWAS 


8693 


30.16% 


69.84% 


8.24% 


1.69% 


-4.7 


-4.64 


eQTL+TF binding+matched TF mo- 


240 


14.58% 


85.42% 


11.67% 


0.83% 


-4.6 


-4.7 


tif+matched DNase Footprint+DNase 
















peak 
















eQTL+TF binding+any motif+DNase 


1965 


18.78% 


81.22% 


9.97% 


0.51% 


-4.64 


-4.69 


Footprint—DNasc peak 
















eQTL+TF binding+matched TF mo- 


62 


25.81% 


74.19% 


8.06% 


1.61% 


-4.69 


-4.65 


tif— DNase peak 
















eQTL+TF binding+any motif+DNase 


1263 


22.33% 


77.67% 


9.11% 


0.55% 


-4.67 


-4.7 


peak 
















eQTL+TF binding+matched TF motif 


44 


31.82% 


68.18% 


13.64% 


2.27% 


-4.62 


-4.56 


eQTL+TF binding / DNase peak 


26610 


27.75% 


72.25% 


7.16% 


0.86% 


-4.71 


-4.7 


TF binding+matched TF mo- 


12411 


17.94% 


82.06% 


13.42% 


0.54% 


-4.6 


-4.65 


tif+matched DNase Footprint+DNase 
















peak 
















TF binding+any motif+DNase Foot- 


120642 


22.76% 


77.24% 


9.62% 


0.82% 


-4.66 


-4.66 


print+DNase peak 
















TF binding+matched TF motif+DNase 


5355 


30.51% 


69.49% 


7.79% 


1.05% 


-4.72 


-4.68 


peak 
















TF binding+any motif+DNase peak 


96905 


28.49% 


71.51% 


8.61% 


1.29% 


-4.69 


-4.64 


TF binding+matched TF motif 


4359 


38.63% 


61.37% 


6.33% 


0.99% 


-4.78 


-4.7 


TF binding+DNasc peak 


418548 


24.64% 


75.36% 


8.15% 


0.87% 


-4.68 


-4.68 


TF binding or DNase peak 


1625195 


33.43% 


66.57% 


7.02% 


1.32% 


-4.74 


-4.66 


CO - CTCF (strong) 


20813 


18.32% 


81.68% 


6.75% 


0.15% 


-4.69 


-4.77 


CI - CTCF (weak) 


61946 


27.65% 


72.35% 


5.21% 


0.46% 


-4.75 


-4.77 


D - dead zone 


873933 


52.91% 


47.09% 


4.24% 


0.65% 


-4.89 


-4.75 


E/GM - enhancer/gene middle 


70593 


23.64% 


76.36% 


7.94% 


0.67% 


-4.68 


-4.7 


FO - FAIRE only 


1177948 


36.88% 


63.12% 


6.41% 


1.04% 


-4.76 


-4.69 


Fl- FAIRE only 


1481766 


41.02% 


58.98% 


5.94% 


1.03% 


-4.79 


-4.69 


GEO - gene body (end) 


413390 


33.96% 


66.04% 


8.03% 


1.09% 


-4.72 


-4.66 


GE1 - gene body (end) 


163365 


28.37% 


71.63% 


8.06% 


1.03% 


-4.7 


-4.67 


GE2 - gene body (end) 


54261 


32.46% 


67.54% 


6.38% 


0.94% 


-4.74 


-4.66 


GMO - gene body (middle) 


130000 


27.23% 


72.77% 


7.27% 


0.82% 


-4.71 


-4.7 


GM1 - gene body (middle) 


101426 


24.46% 


75.54% 


6.44% 


0.62% 


-4.71 


-4.73 


GS - gene body (start) 


54940 


10.49% 


89.51% 


11.35% 


0.45% 


-4.58 


-4.7 


H3K9mel 


457492 


62.34% 


37.66% 


2.70% 


0.27% 


-4.98 


-4.85 


L0 - low zone 


673274 


52.14% 


47.86% 


4.02% 


0.67% 


-4.89 


-4.76 


LI - low zone 


725876 


33.64% 


66.36% 


6.32% 


1.24% 


-4.75 


-4.68 


N/A 


53354 


35.18% 


64.82% 


0.87% 


0.07% 


-4.96 


-5.09 


R0 - repression 


716710 


42.48% 


57.52% 


6.18% 


1.11% 


-4.79 


-4.68 


Rl - repression 


335824 


32.41% 


67.59% 


6.33% 


1.01% 


-4.75 


-4.69 


R2 - repression 


447655 


34.45% 


65.55% 


7.83% 


1.37% 


-4.73 


-4.65 


R3 - repression 


433156 


41.40% 


58.60% 


5.10% 


0.85% 


-4.81 


-4.72 


R4 - repression 


205971 


41.10% 


58.90% 


6.35% 


0.80% 


-4.78 


-4.7 


R5 - repression 


297152 


31.70% 


68.30% 


5.15% 


0.66% 


-4.77 


-4.74 


TFO - transcription factor activity 


346083 


35.10% 


64.90% 


4.81% 


0.69% 


-4.79 


-4.75 


TF1 - transcription factor activity 


327695 


45.07% 


54.93% 


4.85% 


0.65% 


-4.83 


-4.74 


TF2 - transcription factor activity 


127366 


33.95% 


66.05% 


5.69% 


0.82% 


-4.76 


-4.68 


TSS - transcription start site 


25144 


5.01% 


94.99% 


21.84% 


0.89% 


-4.46 


-4.6 
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